A Weighted Combination of Speech with Text-based Models for Arabic Diacritization
نویسندگان
چکیده
The majority of studies on Arabic diacritization have employed textually inferred features alone. This paper proposes a novel approach, where the weighted combination of speech with a text-based model is used to allow linguistically-insensitive acoustic information to correct and complement the errors generated by the text model’s diacritic predictions. The acoustic model is based on Hidden Markov Models and the textual model on Conditional Random Fields. The combination brings significant reduction in error rates across all metrics, especially in case endings, which are the most difficult to predict. It gives results superior to those of conventional methods, with diacritic and word error rates of 1.5 and 4.9 inclusive of case endings, and 1.0 and 2.7 exclusive of them.
منابع مشابه
Maximum entropy modeling for diacritization of Arabic text
We propose a novel modeling framework for automatic diacritization of Arabic text. The framework is based on Markov modeling where each grapheme is modeled as a state emitting a diacritic (or none) from the diacritic space. This space is exactly defined using 13 diacritics and a null-diacritic and covers all the diacritics used in any Arabic text. The state emission probabilities are estimated ...
متن کاملExploiting Arabic Diacritization for High Quality Automatic Annotation
We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for ...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملMADA+TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization
We describe the MADA+TOKAN toolkit, a versatile and freely available system that can derive extensive morphological and contextual information from raw Arabic text, and then use this information for a multitude of crucial NLP tasks. Applications include high-accuracy part-of-speech tagging, diacritization, lemmatization, disambiguation, stemming, and glossing. MADA operates by examining a list ...
متن کاملArabic Diacritization with Recurrent Neural Networks
Arabic, Hebrew, and similar languages are typically written without diacritics, leading to ambiguity and posing a major challenge for core language processing tasks like speech recognition. Previous approaches to automatic diacritization employed a variety of machine learning techniques. However, they typically rely on existing tools like morphological analyzers and therefore cannot be easily e...
متن کامل